Enable multiple LoRa adapters by drbh · Pull Request #2010 · huggingface/text-generation-inference

drbh · 2024-06-04T20:14:45Z

This PR is a work in progress to add support for mutliple loras to be loaded at startup and then use 0 or 1 adapters in a request by specifying the adapter id.

Example usage

download adapter without auto merging

text-generation-server download-weights predibase/dbpedia
text-generation-server download-weights predibase/customer_support

start server with multiple LoRa adapters

LORA_ADAPTERS=predibase/customer_support,predibase/dbpedia \
text-generation-launcher --model-id mistralai/Mistral-7B-v0.1

sending request without adapter id

curl 127.0.0.1:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
    "inputs": "What are 3 unique words that describe you?",
    "parameters": {
    "max_new_tokens": 40
  }
}'

{
  "generated_text": "\n\nI’m a very passionate person. I’m very driven. I’m very determined.\n\nWhat is your favorite thing about being a teacher?\n\nI love the fact"
}

with first LoRa adapter specified

curl 127.0.0.1:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
    "inputs": "What are 3 unique words that describe you?",
    "parameters": {
    "max_new_tokens": 40,
    "adapter_id": "predibase/customer_support"
  }
}'

{
  "generated_text": "\n\nI’m not sure if I can come up with 3 unique words that describe me, but I’ll try.\n\n1. Creative\n2. Funny\n3."
}

with second LoRa adapter specified

curl 127.0.0.1:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
    "inputs": "You are given the title and the body of an article below. Please determine the type of the article.### Title: Great White Whale\n\n### Body: Great White Whale is the debut album by the Canadian rock band Secret and Whisper. The album was in the works for about a year and was released on February 12 2008.",                                                                                             
    "parameters": {
    "max_new_tokens": 40,
    "adapter_id": "predibase/dbpedia"
  }
}'

{
  "generated_text": "8"
}

flozi00 · 2024-06-04T20:26:57Z

Hey guys, as I am switched to lorax and started contributing there a lot after the first license change I am happy to see the PR got opened

I would be happy if you are open for some questions and discussion about this.
What do you think about following the lorax style for the api so it could be a drop in replacement, especially for the openai endpoint?
Furthermore are you open for some kernel optimizations like punica, used in lorax and vllm, to minimize the overhead and enable efficient batching of different adapters ?

I'd would be happy to contribute here too.

drbh · 2024-06-05T01:14:19Z

hi @flozi00 thanks for the feedback! can you share more about the lorax style api? I see that in lorax you can specify the adapter via the model field in the chat endpoint, is that the feature you're referring to? regarding kernel optimizations, yes we are very interested and plan on using optimized kernels/punica (planning on diving into this specific thing tomorrow/this week). Any suggestions/contributions/patches related to this are always appreciated!

flozi00 · 2024-06-05T07:02:52Z

Yes, i mean the "adapter_id" inside "parameters" for the tgi api (as you did it, i see now), and the "model" in the openai api :)

drbh · 2024-06-06T00:08:50Z

update:

This PR's implementation has been updated to align with the great work done by the lorax team. This implementation tries to use the same layers when possible and only diverges to work with TGI's recent updates/improvements and limits lora to loading at startup. Current changes allow weights to be loaded similar to Lorax, however there are still issues with generation to be resolved, and other refactors

flozi00 · 2024-06-06T19:07:24Z

Looks like you are successfully adopting the lorax code
Please tell me if you need any help with this feature

drbh · 2024-06-06T19:49:14Z

@flozi00 generation with loras is mostly stable, just focusing on the rebase then refactors now. And thank you 🙂 a review once the PR is ready would be super helpful!

HuggingFaceDocBuilderDev · 2024-06-07T01:25:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tgaddair · 2024-06-07T05:17:50Z

Thanks for the shoutout in the docs! It's quite interesting to see things come full circle, maybe we should chat about merging our projects.

drbh · 2024-06-07T13:58:03Z

of course @tgaddair thank you for the awesome work! thats an interesting idea and we are always aiming to improve TGI. We appreciate any contributions/discussions about features that may be helpful to our users

flozi00 · 2024-06-07T19:08:43Z

Thanks for the shoutout in the docs! It's quite interesting to see things come full circle, maybe we should chat about merging our projects.

I'd love to migrate to tgi again 👍 And of course trying to contribute here too @tgaddair

drbh · 2024-06-18T13:53:48Z

hi @xiadingZ in this PR lora adapters are loaded from the HUGGINGFACE_HUB_CACHE directory. Once you've downloaded the lora locally, you can specify the id like LORA_ADAPTERS=predibase/customer_support which will use the local lora model.

once this initial lora work is merged we'll follow up with other improvement such as easier ways to specify lora path, and etc

xiadingZ · 2024-06-19T10:15:35Z

hi @xiadingZ in this PR lora adapters are loaded from the HUGGINGFACE_HUB_CACHE directory. Once you've downloaded the lora locally, you can specify the id like LORA_ADAPTERS=predibase/customer_support which will use the local lora model.

once this initial lora work is merged we'll follow up with other improvement such as easier ways to specify lora path, and etc

Hi, @drbh I can try your methods with downloaded lora. But I have a lora adapter trained locally. It doesn't have a directory structure such as blobs, refs, snapshots from huggingface. how can I place it in HUGGINGFACE_HUB_CACHE and load it?

I set HUGGINGFACE_HUB_CACHE as /root/lora_adapters and my directory structure is:

server/Makefile-lorax-punica

server/text_generation_server/adapters/__init__.py

server/text_generation_server/adapters/config.py

server/text_generation_server/utils/adapter.py

server/text_generation_server/utils/peft.py

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

server/text_generation_server/layers/lora.py

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

server/text_generation_server/layers/lora.py

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

server/text_generation_server/models/globals.py

server/text_generation_server/models/model.py

server/text_generation_server/utils/merges/strategies.py

danieldk · 2024-06-24T08:48:03Z

Forgot to add: we probably want an integration test as well.

danieldk

Thanks for all the changes! Looks ready to merge to me after the small nit that breaks CI is fixed.

drbh · 2024-06-25T17:51:37Z

@danieldk thanks for the review! I've fixed the nits and CI passes. Going to go ahead and merge based on your last approval

mhou7712 · 2024-07-01T12:25:14Z

Hi @flozi00, is it possible to look at my issue #2143 and let me know any suggestion? Thanks.

* feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>

drbh force-pushed the lora-internal branch from 091f2dc to d103264 Compare June 6, 2024 19:52

drbh marked this pull request as ready for review June 7, 2024 12:24

drbh changed the title ~~Lora internal~~ Enable multiple LoRa adapters Jun 7, 2024

drbh requested review from Narsil, OlivierDehaene and danieldk June 7, 2024 15:26

NielsRogge mentioned this pull request Jun 10, 2024

Multi-lora support #1622

Closed

2 tasks

drbh added 13 commits June 10, 2024 10:23

feat: first draft load multiple lora

db3d8e6

feat: load weights within layer and refactor lora pass

0a6ea7f

fix: refactor and reduce lora math

a046c30

feat: baseline impl single request multi lora support

c661631

feat: prefer lorax implementation and port loading logic

8b50f4b

fix: prefer adapter_data and refactors

d5f21d5

feat: perfer loraxs custom punica kernels and add mlp loras

8984ce6

fix: adjust batch for bgmv

ad088d5

fix: adjust adapter_segments logic when in batch

c927376

fix: refactor and move changes to v3 proto

73eb2ae

fix: pass model_id for all flash causal lms

88bd5c2

fix: pass model_id for all causal and seq2seq lms

dc0f765

fix: add model_id to model test

9c45d34

drbh added 3 commits June 14, 2024 14:02

fix: merge 'main' into lora-internal to resolve conflicts

0e1c28c

Merge branch 'main' into lora-internal

1104885

Merge branch 'main' into lora-internal

224455f

danieldk reviewed Jun 19, 2024

View reviewed changes

drbh added 4 commits June 19, 2024 16:13

fix: refactors and adjust flash llama lora logic

4f1543d

fix: skip llama test due to CI issue (temp)

ce70fce

fix: skip llama test CI (temp) 2

c9e4526

fix: revert skips and prefer updated ci token for tests

a07b612

danieldk reviewed Jun 20, 2024

View reviewed changes

danieldk reviewed Jun 21, 2024

View reviewed changes

drbh added 4 commits June 24, 2024 22:01

fix: refactors and helpful comments

3c9b28e

fix: add noop in TensorParallelAdapterRowLinear too

c927cff

fix: refactor and move shard_lora_weights logic

f94f2b3

Merge branch 'main' into lora-internal

0d496ba

danieldk previously approved these changes Jun 25, 2024

View reviewed changes

fix: exit early if no adapter_data

a2d821c

drbh dismissed danieldk’s stale review via a2d821c June 25, 2024 16:20

Merge branch 'main' into lora-internal

59575fe

drbh merged commit 04e1af9 into main Jun 25, 2024

drbh deleted the lora-internal branch June 25, 2024 18:46

drbh mentioned this pull request Jun 26, 2024

feat: use model name as adapter id in chat endpoints #2128

Merged

philschmid mentioned this pull request Jun 28, 2024

Add TGI 2.1.0 awslabs/llm-hosting-container#82

Closed

drbh mentioned this pull request Jun 28, 2024

fix: use weights from base_layer #2141

Merged

mhou7712 mentioned this pull request Jun 28, 2024

Low-Rank Adaptation of Large Language Models #1973

Closed

Comments

Conversation

drbh commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example usage

Uh oh!

flozi00 commented Jun 4, 2024

Uh oh!

drbh commented Jun 5, 2024

Uh oh!

flozi00 commented Jun 5, 2024

Uh oh!

drbh commented Jun 6, 2024

Uh oh!

flozi00 commented Jun 6, 2024

Uh oh!

drbh commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 7, 2024

Uh oh!

tgaddair commented Jun 7, 2024

Uh oh!

drbh commented Jun 7, 2024

Uh oh!

flozi00 commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drbh commented Jun 18, 2024

Uh oh!

xiadingZ commented Jun 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danieldk commented Jun 24, 2024

Uh oh!

danieldk left a comment

Choose a reason for hiding this comment

Uh oh!

drbh commented Jun 25, 2024

Uh oh!

mhou7712 commented Jul 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

drbh commented Jun 4, 2024 •

edited

Loading

drbh commented Jun 6, 2024 •

edited

Loading

flozi00 commented Jun 7, 2024 •

edited

Loading